barack obama
Applying Relation Extraction and Graph Matching to Answering Multiple Choice Questions
Shimoda, Naoki, Yamamoto, Akihiro
In this research, we combine Transformer-based relation extraction with matching of knowledge graphs (KGs) and apply them to answering multiple-choice questions (MCQs) while maintaining the traceability of the output process. KGs are structured representations of factual knowledge consisting of entities and relations. Due to the high construction cost, they had been regarded as static databases with validated links. However, the recent development of Transformer-based relation extraction (RE) methods has enabled us to generate KGs dynamically by giving them natural language texts, and thereby opened the possibility for representing the meaning of the input sentences with the created KGs. Using this effect, we propose a method that answers MCQs in the "fill-in-the-blank" format, taking care of the point that RE methods generate KGs that represent false information if provided with factually incorrect texts. We measure the truthfulness of each question sentence by (i) converting the sentence into a relational graph using an RE method and (ii) verifying it against factually correct KGs under the closed-world assumption. The experimental results demonstrate that our method correctly answers up to around 70% of the questions, while providing traceability of the procedure. We also highlight that the question category has a vast influence on the accuracy.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.05)
- North America > United States > Hawaii (0.04)
- (5 more...)
- Government > Regional Government > North America Government > United States Government (1.00)
- Education (0.86)
Can Large Language Models Express Uncertainty Like Human?
Tao, Linwei, Yeh, Yi-Fan, Kai, Bo, Dong, Minjing, Huang, Tao, Lamb, Tom A., Yu, Jialin, Torr, Philip H. S., Xu, Chang
Large language models (LLMs) are increasingly used in high-stakes settings, where overconfident responses can mislead users. Reliable confidence estimation has been shown to enhance trust and task accuracy. Y et existing methods face practical barriers: logits are often hidden, multi-sampling is computationally expensive, and verbalized numerical uncertainty (e.g., giving a 0-100 score) deviates from natural communication. We revisit linguistic confidence (LC), where models express uncertainty through hedging language (e.g., probably, might), offering a lightweight and human-centered alternative. To advance this direction, we 1) release the first diverse, large-scale dataset of hedging expressions with human-annotated confidence scores, and 2) propose a lightweight mapper that converts hedges into confidence scores at near-zero cost. Building on these resources, we 3) conduct the first systematic study of LC across modern LLMs and QA benchmarks, revealing that while most LLMs underperform in expressing reliable LC, carefully designed prompting achieves competitive calibration and discriminability. Finally, we 4) introduce a fine-tuning framework that further improves LC reliability. Taken together, our work positions linguistic confidence as a scalable, efficient, and human-aligned approach to LLM uncertainty estimation, and calls for deeper exploration of this promising yet underexplored direction. The code and dataset are anonymously available at https://anonymous. Large language models (LLMs) are increasingly deployed in real-world applications, from education and healthcare to law and scientific discovery. While their capabilities make them powerful assistants, LLMs are also prone to hallucinations and factual errors, and human overreliance on their outputs can lead to serious consequences. For instance, a U.S. lawyer once submitted fabricated cases generated by ChatGPT, resulting in professional sanctions (ABC News, 2023). Recent social experiments demonstrate that people adjust their reliance on AI depending on how confident the model appears: reliable expressions of uncertainty can enhance trust, satisfaction, and task accuracy (Kim et al., 2024; Xu et al., 2025). These findings highlight the importance of associating reliable uncertainty estimates with LLM responses to support human decision-making. Ultimately, the conveyance of confidence plays a central role in shaping trust and guiding human-AI interaction. A growing body of work explores the extraction and representation of confidence in LLM outputs. These methods are simple and inexpensive but require access to model logits, which are typically unavailable in commercial LLM APIs. However, such scores rarely align with common user behavior or natural communication, as users do not typically phrase queries with explicit instructions like "Please output your confidence along with the answer."
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- Asia > China > Shanghai > Shanghai (0.04)
- (4 more...)
- Government > Regional Government > North America Government > United States Government (0.70)
- Media > Television (0.47)
SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge
Haas, Lukas, Yona, Gal, D'Antonio, Giovanni, Goldshtein, Sasha, Das, Dipanjan
We introduce SimpleQA Verified, a 1,000-prompt benchmark for evaluating Large Language Model (LLM) short-form factuality based on OpenAI's SimpleQA. It addresses critical limitations in OpenAI's benchmark, including noisy and incorrect labels, topical biases, and question redundancy. SimpleQA Verified was created through a rigorous multi-stage filtering process involving de-duplication, topic balancing, and source reconciliation to produce a more reliable and challenging evaluation set, alongside improvements in the autorater prompt. On this new benchmark, Gemini 2.5 Pro achieves a state-of-the-art F1-score of 55.6, outperforming other frontier models, including GPT-5. This work provides the research community with a higher-fidelity tool to track genuine progress in parametric model factuality and to mitigate hallucinations. The benchmark dataset, evaluation code, and leaderboard are available at: https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- South America > Colombia (0.04)
- North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
- (7 more...)
- Leisure & Entertainment (1.00)
- Government (0.69)
- Media > Television (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.46)
Too Consistent to Detect: A Study of Self-Consistent Errors in LLMs
Tan, Hexiang, Sun, Fei, Liu, Sha, Su, Du, Cao, Qi, Chen, Xin, Wang, Jingang, Cai, Xunliang, Wang, Yuanzhuo, Shen, Huawei, Cheng, Xueqi
As large language models (LLMs) often generate plausible but incorrect content, error detection has become increasingly critical to ensure truthfulness. However, existing detection methods often overlook a critical problem we term as self-consistent error, where LLMs repeatedly generate the same incorrect response across multiple stochastic samples. This work formally defines self-consistent errors and evaluates mainstream detection methods on them. Our investigation reveals two key findings: (1) Unlike inconsistent errors, whose frequency diminishes significantly as the LLM scale increases, the frequency of self-consistent errors remains stable or even increases. (2) All four types of detection methods significantly struggle to detect self-consistent errors. These findings reveal critical limitations in current detection methods and underscore the need for improvement. Motivated by the observation that self-consistent errors often differ across LLMs, we propose a simple but effective cross-model probe method that fuses hidden state evidence from an external verifier LLM. Our method significantly enhances performance on self-consistent errors across three LLM families.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (4 more...)
American citizen killed in Russian attack on Kyiv, State Department confirms
A U.S. citizen died during a Russian missile attack on the Ukrainian capital of Kyiv, the State Department confirmed Tuesday afternoon. An American citizen was among the 15 killed in Russian drone and missile strikes on the Ukrainian capital city, Kyiv, on Tuesday, State Department spokesperson Tammy Bruce confirmed in a press conference Wednesday. In response to a reporter's question on U.S. diplomats in Kyiv having to spend the night in a bunker, Bruce said "we can confirm the death of a U.S. citizen in Ukraine." "We are aware of last night's attack on Kyiv that resulted in numerous casualties, including the tragic death of a U.S. citizen," she said, noting, "We condemn those strikes and extend our deepest condolences to the victims and to the families of all those affected." Bruce did not offer any more details on the identity of the citizen killed by the Russian strikes, citing "respect to the family during this obviously horrible time."
- Europe > Ukraine > Kyiv Oblast > Kyiv (1.00)
- Asia > Russia (1.00)
- North America > Canada (0.34)
- (6 more...)
Tuning LLM Judge Design Decisions for 1/1000 of the Cost
Salinas, David, Swelam, Omar, Hutter, Frank
Evaluating Large Language Models (LLMs) often requires costly human annotations. To address this, LLM-based judges have been proposed, which compare the outputs of two LLMs enabling the ranking of models without human intervention. While several approaches have been proposed, many confounding factors are present between different papers. For instance the model, the prompt and other hyperparameters are typically changed at the same time making apple-to-apple comparisons challenging. In this paper, we propose to systematically analyze and tune hyperparameter of LLM judges. To alleviate the high cost of evaluating a judge, we propose to leverage multi-objective multi-fidelity which allows to find judges that trades accuracy for cost and also reduce significantly the cost of the search. Our method identifies judges that not only outperform existing benchmarks in accuracy and cost-efficiency but also utilize open-weight models, ensuring greater accessibility and reproducibility.
- Europe > Germany > Baden-Württemberg > Freiburg (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (2 more...)
5 likely choices for who really ran the disastrous Biden White House
For years, conservative media, lawmakers and talking heads have been sounding the alarm about President Joe Biden's cognitive free fall. And for years, left-wing media, lawmakers and their loyal mouthpieces waved it off with the same condescending dismissal -- accusing us of lying, fear-mongering or worse. Some even went so far as to say they couldn't keep up with Biden's supposed brilliance and jam-packed schedule of what was mostly just one morning briefing and two mid-afternoon naps. Now that Biden has shuffled out of office, left-wing media seems to be waking up to the glaringly obvious. The New York Times of all places -- yes, the same paper that acted as Biden's PR firm -- has revealed that he relied on teleprompters during intimate fundraisers in private homes.
- North America > United States (1.00)
- Asia > Afghanistan (0.05)
- Europe (0.05)
- (2 more...)
- Government > Regional Government > North America Government > United States Government (1.00)
- Government > Military (0.96)
Harris' 'ice princess' demeanor, Bush's belly-tap were key expressions at Jimmy Carter's funeral: expert
Presidents Clinton, George H.W. Bush, Obama, Biden and Trump all pay respect to Jimmy Carter at his state funeral in Washington, D.C.. During the 2024 campaign cycle, Americans witnessed what appeared to be no love lost between President-elect Donald Trump and former President Barack Obama. However, at former President Jimmy Carter's funeral the two recent presidents appeared to be enjoying each other's company and largely ignored other dignitaries arriving around them, including Vice President Kamala Harris and President Biden. Susan Constantine, a communication and body language expert, said Harris came off "as cool as could be." When she was walking she was very robotic.
- North America > United States > District of Columbia > Washington (0.25)
- North America > United States > New York (0.06)
- North America > United States > Pennsylvania (0.05)
Characteristics of Political Misinformation Over the Past Decade
Although misinformation tends to spread online, it can have serious real-world consequences. In order to develop automated tools to detect and mitigate the impact of misinformation, researchers must leverage algorithms that can adapt to the modality (text, images and video), the source, and the content of the false information. However, these characteristics tend to change dynamically across time, making it challenging to develop robust algorithms to fight misinformation spread. Therefore, this paper uses natural language processing to find common characteristics of political misinformation over a twelve year period. The results show that misinformation has increased dramatically in recent years and that it has increasingly started to be shared from sources with primary information modalities of text and images (e.g., Facebook and Instagram), although video sharing sources containing misinformation are starting to increase (e.g., TikTok). Moreover, it was discovered that statements expressing misinformation contain more negative sentiment than accurate information. However, the sentiment associated with both accurate and inaccurate information has trended downward, indicating a generally more negative tone in political statements across time. Finally, recurring misinformation categories were uncovered that occur over multiple years, which may imply that people tend to share inaccurate statements around information they fear or don't understand (Science and Medicine, Crime, Religion), impacts them directly (Policy, Election Integrity, Economic) or Public Figures who are salient in their daily lives. Together, it is hoped that these insights will assist researchers in developing algorithms that are temporally invariant and capable of detecting and mitigating misinformation across time.
- North America > United States > New York (0.04)
- North America > United States > Nevada > Clark County > Las Vegas (0.04)
- North America > United States > Massachusetts (0.04)
- (4 more...)
- Media > News (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
Measuring short-form factuality in large language models
Wei, Jason, Karina, Nguyen, Chung, Hyung Won, Jiao, Yunxin Joy, Papay, Spencer, Glaese, Amelia, Schulman, John, Fedus, William
An open problem in artificial intelligence is how to train language models that produce responses that are factually correct. Current frontier models sometimes produce false outputs or answers that are not substantiated by evidence, a problem known as "hallucinations." Such hallucinations are one of the major barriers for broader adoption of general forms artificial intelligence like large language models. Factuality is a complicated topic because it is hard to measure--evaluating the factuality of any given arbitrary claim can be challenging, and language models often generate long completions that contain dozens of factual claims. In this work we will sidestep the open-endedness of language models by considering only short, fact-seeking questions with a single answer. This reduction of scope is important because it makes measuring factuality much more tractable, albeit at the cost of leaving open research questions such as whether improved behavior on short-form factuality generalizes to long-form factuality. We present a benchmark called SimpleQA, which contains 4,326 short, fact-seeking questions. SimpleQA was designed with a few important properties in mind: High correctness. Reference answers to questions are determined by two independent AI trainers, and questions were written in such a way that the predicted answers are easily gradable.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- South America > Argentina (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Europe > Netherlands (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)